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Abstract —Spatio-temporal data is intrinsically high dimen¬ 
sional, so unsupervised modeling is only feasible if we can exploit 
structure in the process. When the dynamics are local in both 
space and time, this structure can be exploited by splitting the 
global field into many lower-dimensional “light cones”. We review 
light cone decompositions for predictive state reconstruction, 
introducing three simple light cone algorithms. These methods 
allow for tractable inference of spatio-temporal data, such as 
full-frame video. The algorithms make few assumptions on the 
underlying process yet have good predictive performance and 
can provide distributions over spatio-temporal data, enabling 
sophisticated probabilistic inference. 

I. Introduction 

Modeling spatio-temporal data, such as high resolution 
video, is hard. The sheer dimensionality of the data often 
makes global inference methods difficult. Similarly curses of 
dimensionality for textual and time-series data have been met 
with great success by HMMs [Rabiner, 1989], using localized 
models for prediction and tractable inference on sequences. 
Inspired by this example, we look to localized models for 
modeling of spatio-temporal data, like video and fMRI data. 
Light cone methods, such as mixed LICORS [Goerg and 
Shalizi, 2013], successfully reduce the global inference task to 
iterating a tractable, localized one. These methods can be used 
for both regression (point predictions of -valued outputs 
from input variables) and computing probability densities. The 
latter property allows one to tractably compute distributions 
over spaces of events, e.g., over the space of all possible 
videos, U*, just as HMMs induce probability distributions over 
the set S* of all possible sequences (Figure 1). This ability 
could make light cone decompositions as general and useful 
for modeling spatio-temporal data as HMMs are for textual 
and time-series data. 

The goals of this manuscript are thus: (1) Showing how light 
cone decompositions help make spatio-temporal modeling 
tasks tractable; (2) Introducing three easy-to-implement light 
cone algorithms, allowing others to begin experimenting with 
light cone methods; (3) Assessing the predictive accuracy of 
light cones methods on two video prediction tasks; and (4) 
Providing a finite sample guarantee on the error of predictive 
state light cone methods. We begin with some preliminaries. 



Fig. 1. Probability densities over the space of all strings, £*, and the space 
of all videos, V*. 


II. Notation and Preliminaries 

Given a random field X(r , £), observed for each point r 
on a regular spatial lattice S at discrete time instants t = 
1, ..., T, we seek to approximate a joint likelihood over the 
observations of the spatio-temporal process, and to accurately 
forecast the future of the process. Since causal influences in 
physical systems only propagate at finite speed (denoted c), we 
follow Parlitz and Merkwirth [2000] and adopt the concept of 
light cones , which are defined as the set of events that could 
influence (r, t). Formally, a past light cone (PLC) is the set 
of all past variables 1 that could have affected X(r ,£): 

t ) := {X(u, s) | s<t, ||r — u || 2 < c • (t — s)}. 

Similarly, a future light cone (FLC) is the set of all future 
events that could be affected by (r, t). As a practical matter, 
not all past (or future) events are equally informative, since 
more recent events tend to exert greater causal influence. Thus, 
in practice, we can approximate the true past light cone with a 
much smaller subset light cone, improving tractability without 
incurring much predictive error. 

Furthermore, we adopt the conditional independence as¬ 
sumption for light cones given in Goerg and Shalizi [2013], 
which allows for the factorization of the joint likelihood into 
the product of conditional likelihoods. Indexing each X(r,t) 

1 Strictly, we should distinguish the light cone proper, which is a region of 
space-time, from the configuration of the random field over this region. We 
elide the distinction for brevity. 








by a single integer i = 1, ..., N for simplicity of notation, 
the joint pdf of X% 3 ..., Xjy factorizes as 

N 

p(x 1 ,...,x N )^]Jp(x i \e-), 

i= 1 

where the proportionality accounts for incompletely observed 
light cones along the edge of the field. 

Given this factorization, it becomes natural to seek equiv¬ 
alence classes of light cones, namely, i.e., to cluster light 
cones into sets based on the similarity of their conditional 
distributions. Such equivalence classes of past light cones are 
predictive states [Knight, 1975, Goerg and Shalizi, 2012], and 
our immediate goals become twofold: first, to discover these 
latent predictive states (i.e., learn a mapping e from PLCs 
to predictive states), and second, to estimate the conditional 
distribution over X given its predictive state. Goerg and Shalizi 
[2012] introduced LICORS as a nonparametric method of 
predictive state reconstruction, followed by mixed LICORS 
[2013] as a mixture model extension of LICORS, where each 
future light cone is forecast using a mixture of extremal 
predictive states. Mixed LICORS has predictive advantages 
over the original LICORS, but requires finding an N x K 
matrix of weights (where N is the number of light cones and 
K the number of predictive states) using a form of EM, where 
each weight is determined using a kernel density estimate on 
all points. Each EM iteration takes 0(N 2 K ) steps, slowing 
mixed LICORS considerably for large N. Almost equally 
daunting, the original algorithms are quite complex, difficult 
to implement and debug, inhibiting their adoption. 

III. Contributions 

We review the use of light cones for localized spatio- 
temporal prediction. We introduce two simplified nonparamet¬ 
ric methods for the predictive state reconstruction task and 
a simple regression light cone method for fast and accurate 
forecasting. The first predictive state method, Moonshine, is 
a simple meta-algorithm consisting of basic clustering steps 
combined with dimensionality reduction and nonparametric 
density estimation. Moonshine is instance-based and requires 
no iterative likelihood maximization, yet retains many of 
the qualities of the more complex mixed LICORS method. 
The second predictive state algorithm, One Hundred Proof 
(OHP), simplifies the Moonshine approach further and con¬ 
sists of clustering in the space of future light cones, using 
the clusters to obtain state-specific nonparametric density 
estimates over the space of PLCs and FLCs. These simple 
algorithms are much easier to implement than the LICORS 
algorithms, being simplified approximations of the mixed 
LICORS system, yet retain many of their forecasting and 
modeling strengths. 

We further conduct two sets of empirical experiments show¬ 
ing the predictive power of light cone methods for predicting 
video-like data, and report results. Lastly, we give a large 
sample theoretical guarantee for light cone predictive state 
systems. 


The remainder is structured as follows. §IV describes the 
Moonshine, One Hundred Proof and light cone linear re¬ 
gression algorithms. §V describes the experimental setup for 
two real-world spatio-temporal prediction tasks, and gives the 
results of the algorithms and baselines. §VII gives an upper 
bound on the estimation error of our methods. §VIII reviews 
related and future work, while §IX summarizes our findings. 

IV. Methods 

Our simple predictive state reconstruction methods build 
upon the principles introduced in Goerg and Shalizi [2013] 
for mixed LICORS. Both new methods reconstruct a set of 
predictive states and a soft mapping e from past light cones to 
states, through use of nonparametric density estimation over 
the space of light cones. That is, for all past light cones £~ 
the methods compute 

e(£~) = [wi,w 2 , ■ ■ .,w k ] T , 

where Wj is the normalized weight of state Sj for light cone 
t ~. Unlike mixed LICORS, the new methods avoid having to 
explicitly construct an N x K matrix, yet retain the benefits 
of soft membership mixture modeling. 

After describing the reconstruction algorithms, we discuss 
how one can determine the conditional probability density of 
an observation given its past light cone, and how to use this 
conditional density in forecasts. We then describe an additional 
pure regression light cone method, useful for fast and accurate 
forecasting without state reconstruction. Appendix 2 describes 
parameter settings and practical implementation issues that 
arise when using the algorithms. 

A. Moonshine 
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Fig. 2. Component stages of the Moonshine algorithm. 

Algorithm 1 Moonshine 

1: Decompose spatio-temporal process into light cone (PLC, FLC) 
observation tuples. 

2: Cluster PLCs using density based clustering. 

3: Compute cluster-conditioned density estimates for 2 K + 1 ran¬ 
dom points. 

4: if number of clusters > maximum number then 
5: Merge clusters in the space of reduced dimension. 

6: end if 

7: Map original light cones to final clusters. 

Moonshine begins by decomposing the random field into 
its component light cones, shown at far left in Figure 2. 
The algorithm then proceeds through two successive stages 






of clustering, separated by a dimension-reduction step. The 
main steps of Moonshine are given in Algorithm 1. 

The output of the procedure is a set of predictive states, 
each of which consists of a set of PLCs and FLCs. The 
predictive states are used to create a pair of nonparametric 
density estimates, one over PLCs and one over FLCs, which 
jointly identify each state. 

Initial Clustering: For the first clustering step, Moonshine 
uses a density-based clustering approach [Ester et al., 1996] 
to cluster the light cones in the space of PLCs, which as¬ 
sumes that similar PLCs have similar predictive consequences. 
Such clustering methods need a specified local-neighborhood 
size, so we begin with small neighborhoods, progressively 
increase until 90% of all points are clustered, and assign the 
remaining points to the nearest cluster center (effectively hy¬ 
bridizing density-based clustering with k- means). This allows 
for good coverage while avoiding formation of a single, all- 
encompassing cluster. (Alternative clustering algorithms, e.g., 
Zahn [1971], Gokcay and Principe [2002], Zhao et al. [2015], 
would also work.) 

Density Estimation and Dimensionality Reduction: The 

FLCs associated with each cluster (mapped through their re¬ 
spective light cones) are used to form kernel density estimates 
over the space of FLCs. In other words, each cluster consists 
of some set of associated FLCs and these FLCs are then used 
to estimate densities over the FLC space. We estimate the 
densities of 2 K + 1 randomly selected points, where K is a 
parameter that affects the degree of dimensionality reduction. 
The log-probability ratio is taken between the first point 
and the remaining 2 K points. This vector of log probability 
ratios forms the “signature” of the cluster, following the 
construction of a canonical sufficient statistic for exponential 
family distributions [Kulhavy, 1996, p. 123]. 

Merging Clusters: If the number of clusters is greater than 
the maximum number of predictive states specified for the 
model, we cluster again to reduce the number. We cluster 
the low-dimensional signature vectors with k- means+-i- [Arthur 
and Vassilvitskii, 2007], to form the final predictive states. The 
original light cones are then assigned to the resulting states, 
so each predictive state has a unique set of PLCs and FLCs 
with which to form nonparametric density estimates over both 
the PLC and FLC spaces. 

B. One Hundred Proof (OHP) 


Algorithm 2 One Hundred Proof 
1: Decompose spatio-temporal process into light cone (PLC, FLC) 
observation pairs. 

2: Cluster FLCs using k- means++ clustering. 

3: Map original light cone pairs to final clusters. 


OHP simplifies Moonshine, with a single clustering step 
and subsequent mapping of light cones to clusters. The main 
difference is the space in which the clustering occurs: Moon¬ 
shine clusters in the space of PLCs, but OHP clusters in the 
space of FLCs. Clustering in FLC space effectively groups 
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Fig. 3. The One Hundred Proof algorithm’s input and output. Light cones 
are input, clustered using the FLCs, which results in density estimates for 
PLCs and FLCs for each state. Densities are drawn as one-dimensional for 
simplicity, but are typically multi-dimensional, continuous objects. 

past light cones by their predictive consequences, learning a 
geometry of our space where points with similar futures are 
“near” each other regardless of differences in their histories. 
This results in predictive states with expected near-minimal- 
variance future distributions [Arthur and Vassilvitskii, 2007], 
such that once we are sure of which state a new PLC maps to, 
we are highly certain of what outcome the state will generate. 

To motivate this choice, imagine that all pasts map to some 
small set of distinct futures, such as to the letters of a discrete 
finite alphabet. Given input past we want to estimate a 
probability function over output X , so one way to do this is 
to group all occurrences of future X = x, and use that cluster 
to estimate the distribution, using Bayes’ Theorem, namely, 

P(X = x | r) oc P(e~\X = x)P(X = x). 

Using nonparametric density estimation over the points ob¬ 
served with outcome x, we can estimate the first quantity on 
the right hand side, and taking the normalized number of mem¬ 
ber outcomes allows one to estimate the second. This example 
can easily extend to continuous quantities, by clustering in the 
space of observed future outcomes and substituting predictive 
states for the finite alphabet, which is the motivation for the 
OHP algorithm. 

The two steps of OHP are (Algorithm 2): 

1) Cluster FLCs: After decomposing our spatio-temporal 
process into light cones, we cluster the FLCs using k- 
means++. The number of clusters (which will become 
the number of predictive states) is a user-defined param¬ 
eter. 

2) Map Light Cones: We then map the original light cones 
to our clusters, and produce our final predictive states, 
which consist of unique sets of PLCs and FLCs. 

As in the case of Moonshine, the FLCs and PLCs for each 
state Sj are used to compute nonparametric density estimates 
over the space of FLCs and PLCs, providing estimators for 















P(X\Sj) and P(t \ Sj ) respectively. Algorithm 2 outlines the which is simply a suitably weighted mixture of the mean 
process of state reconstruction for OHP. predictions for each state. 


C. Predictive Distributions for Light Cone Systems 

Given the states reconstructed by Moonshine or OHP, we 
can estimate predictive distributions as follows. The condi¬ 
tional probability (or probability density) of X given PLC £~ 
is obtained by mixing over the predictive states, namely 

K 

p(x\n = y2p(x,sj\n (d 

3 = 1 
K 

= j2p(x\s j )p(s j \n ( 2 ) 

3=1 


where the second equality follows from the conditional in¬ 
dependence of X and £~ given the predictive state Sj. The 
P(Sj \£~) terms serve as the mixture weights, and Bayes’s 
Theorem yields 
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( 3 ) 

( 4 ) 


All of the quantities in (2) and (4) can be estimated using 
our reconstructed predictive states, which are each associated 
with unique sets of PLCs and FLCs. We estimate P(Sj ) by 
Nj /TV, where N is the total number of light cone observations 
and Nj is the number of light cones assigned to state Sj. The 
two state-conditioned densities P(X\Sj) and P(£~\Sj) are 
estimated using nonparametric density estimation techniques 
(such as kernel density estimation) based on their associated 
FLCs and PLCs. Thus we get 


p(x\n = 




P(X\Sj) ( 5 ) 


where P(X\Sk) and P(£~\Sk) denote the nonparametric den¬ 
sity estimates of the two corresponding conditional densities. 

When we need a point prediction of X , we use the condi¬ 
tional mean: 


e [x\r] = e [e [x\r,s] r] (6) 

= E[E[X|5]r] (7) 

K 

= y]P(5 i |r)E[X|5 i ]. (8) 

3 = 1 

Replacing P(Sj \£~) with (4), plugging in the estimated den¬ 
sities and probabilities, and using the mean future value for 
state Sj (denoted Xj) to estimate E[X\Sj], we obtain the final 
prediction rule 


K 

3 = 1 


( 9 ) 

( 10 ) 


D. Light Cone Linear Regression 

If only predictive regression is needed and not a full gener¬ 
ative model, one can perform linear regression directly using 
light cones. Light cone linear regression uses the same light 
cone decomposition as the LICORS, Moonshine and OHP 
methods, but learns a regression rule directly from past light 
cones to future light cone values. This has the advantages of 
extremely fast prediction and good forecasting accuracy, along 
with simple implementation. We evaluate the performance 
of light cone linear regression on two real-world forecasting 
tasks, in §V. 


V. Experimental Setup 

In order to evaluate the effectiveness of light cone methods, 
we attempt spatio-temporal forecasting on real-world data. 

A. Forecasting Task 1: Electrostatic Potentials 

For the first task, the data come from a set of experiments 
measuring electrostatic potential changes in organic electronic 
materials [Hoffmann et al., 2013]. 2 We learn a common set of 
predictive states across experiments, and do frame-by-frame 
prediction on a single held-out experiment, effectively cross¬ 
validating across experiments. 

Each experiment consists of 7-10 time slices, or frames. 
Each frame is a 256-by-256 matrix of scalar measurements, 
which we call pixels , since the data resembles video in 
structure. Predictions are performed for 254-by-254 pixels in 
each frame after the first, which allows for each pixel to be 
predicted based on a full light cone, thus excluding marginal 
light cones. 

B. Forecasting Task 2: Human Speaker Video 

For the second task, we predict the next frame of a full- 
resolution video from a recording of a human speaker, used in 
generating an intelligent avatar agent. 3 In this task, we perform 
leave-one-frame-out predictions, cross-validating across video 
frames. Each frame consists of 440-by-330 pixels, of which 
predictions are performed on the 428-by-328 interior pixels, 
again excluding marginal light cones. Every fifth frame from 
the video is retained, and light cones are extracted from 
roughly one hundred skip frames. Forty-thousand light cones 
are subsampled for tractability. These light cones are used for 
cross-validation. 

C. Comparison Methods and Parameter Settings 

We compare the performance of predictive state recon¬ 
struction and forecasting systems with some simple baseline 
methods. For all light cone methods, the same set of light 
cones were extracted from the data, with h p = 1, hf =0, and 
c = 1, resulting in PLCs of dimension d = 9 and FLCs with 

2 Specifically, the data were collected using kelvin force probe microscopy 
to measure spatio-temporal changes in electrostatic charge regions on the 
surface of poly(3-hexyl)thiophene film. 

3 Used with permission from GetAbby (True Image Interactive, LLC). 



dimension d = 1. We evaluate the performance of the mixed 
LICORS system, implemented by the authors following Goerg 
and Shalizi [2013]. For tractability, only twenty thousand light 
cones were used in training each fold for the first task, and 
forty thousand for the second. Kernel density estimators were 
used for both PLC density estimation as well as FLC density 
estimation, to improve predictive performance. Initialization 
was performed using k- means++ and the iteration delta was 
set to 0.0019. For light cone linear regression, we use linear 
regression implemented in the scikit-learn package for Python 
[Pedregosa et al., 2011], version 15.2. 

The simplest method we compare against is the “predict 
the value from the last frame” method that simply takes the 
previous value of a pixel and uses that as the prediction for the 
pixel in the current frame. The ^-nearest neighbor regressor 
takes as input a past light cone and finds the /c-nearest PLCs 
in Euclidean space, then takes the weighted average of their 
individual future light cone values and outputs that as the 
current prediction. Below, we report results from the scikit- 
learn implementation of KNeighborsRegressor with default 
parameter settings. 

D. Performance Metrics 

We compared performance in terms of mean-squared-error 
(MSE) and correlation (Pearson p) with the ground truth. 
Additionally, for the three distributional methods (mixed 
LICORS, Moonshine and OHP) we measured the average 
per pixel log-likelihood (Avg. LL) of the predictions, an 
estimate of the (negative) cross-entropy between the model 
and the truth, and the perplexity (2 -Avg LL ), with lower per¬ 
plexity being better. For the distributional methods, we tested 
performance both for a large maximum number of states 
(-A max = 100) and a small number of states (iT max = 10). 

To avoid negative infinities appearing when model likeli¬ 
hoods are sufficiently close to zero, we apply smoothing to 
the three distributional models for all likelihood estimates 
mapping to zero, converting them to likelihoods of 10 -30 °. 

E. Qualitative Results 

Light cone systems compare favorably to state-of-the-art 
deep learning methods, such as Mathieu et al. [2015] (seen 
in Figure 4), which improves on earlier work by Ranzato 
et al. [2014]. The amount of blurring and structural aberration 
becomes noticeable in their prediction examples, reproduced 
here. Compare with Figure 5, where a light cone system 
(mixed LICORS) is used to predict the next frame of human 
video. The light cone predictions maintain strong structural 
consistency and minimal blurring, at the cost of some quanti¬ 
zation effects (due to predictive state clustering). 

For the electrostatic potentials prediction task, Fig. 6 and 
7 show three frames of predictions each for Moonshine and 
OHP, respectively. The next frame (top to bottom) is predicted 
using models trained on the remaining six experiments, given 
PLCs from the previous frame. Error percentage was calcu- 



Fig. 6. Predicting electrostatic potentials with Moonshine. 




Fig. 7. Predicting electrostatic potentials with OHP. 


lated as a proportion of the maximum dynamic range of the 
actual values or predictions, namely, 

= _ \t^p\ _ 

e7Tpct | maxjr : v G T U P} — min{?; : v G T U P}\ 

where T is the set of true testing frames, P is the set 
of predicted frames, t is the true value at a pixel, p is 
the predicted value of a pixel and | • | is the Li norm. 
Qualitatively, both methods do well, capturing much of the 
changing dynamics in each frame. The methods have trouble 
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Fig. 4. Prediction examples of Mathieu et al. [2015] 
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Fig. 5. Predicting video with mixed LICORS light cone system. 











































representing the extreme values at the two “hotspots” (visible 
in the error plots in the third columns), giving instead over¬ 
smoothed predictions. Other than those extreme regions, the 
error residuals lack obvious structure and are relatively small. 

F. Quantitative Results 

Table I shows how well each method did at predicting elec¬ 
trostatic potentials (Task 1). Mixed LICORS and Moonshine 
have the lowest MSE, with 95% confidence intervals disjoint 
from the intervals of other methods. Mixed LICORS also has 
the highest (Pearson) correlation with the true values. Lastly, of 
the generative methods (i.e., mixed LICORS, Moonshine and 
One Hundred Proof), Moonshine and OHP have the highest 
average log-likelihood and lowest perplexity. Thus, mixed 
LICORS and Moonshine provide the best overall performance 
on the dataset. 

Restricting ourselves to the generative methods for a com¬ 
pact number of states (ifmax = 10), mixed LICORS has the 
lowest average MSE, while Moonshine and One Hundred 
Proof have the best probabilistic performance, giving the 
highest likelihoods and lowest perplexities for the data. 

Table II gives the results from video prediction (Task 2). 
Light cone linear regression has the strongest overall per¬ 
formance, with low error and high correlation to the ground 
truth. However, the strong temporal consistency of this dataset 
allows even the LLTP method to perform remarkably well, 
outperforming the predictive state light cone methods. While 
forecasting is relatively easy for this task, being able to 
estimate a likelihood model for such data gives the predictive 
state methods an edge over pure regression methods. 

VI. Discussion 

In this manuscript, we have tested an existing light cone 
method (mixed LICORS), qualitatively comparing it to deep 
learning methods, and introduced three new light cone meth¬ 
ods (light cone linear regression, Moonshine, OHP). The two 
latter predictive state methods are successive approximations 
of the approach used by mixed LICORS, with OHP pushing 
the limit of how simplified we could make the approximation. 
OHP is demonstrated to be one approximation too far, since 
the increased simplification comes at the cost of degraded 
performance. 

On the first real-world spatio-temporal regression task, we 
find that the three LICORS-inspired methods (mixed LICORS, 
Moonshine and One Hundred Proof) are able to accurately 
forecast the changing dynamics of the underlying spatio- 
temporal system. Lurthermore, being generative methods, they 
can be used to compute the likelihood of spatio-temporal data. 
Moonshine and One Hundred Proof (OHP) are conceptually 
simple, easy to implement alternatives to the full mixed 
LICORS system, which give comparable performance for 
likelihood estimation and forecasting on this task. Although 
OHP is the simplest method, it fails to perform well in 
some contexts, such as the second video prediction task, 
showing a trade-off between method simplicity and forecasting 
performance. 


Light cone linear regression is a fast and simple method, and 
is able to perform well on both prediction tasks. It does not 
estimate likelihoods over data as do the other predictive-state 
methods, but moving to generalized linear models would allow 
this. It shows the effectiveness of light cone decompositions 
and remains a useful approach. 

Overall, the best performance on all tasks was achieved or 
shared by the three new methods, with Moonshine having 
the best probabilistic modeling performance on both tasks, 
light cone linear regression having the best forecasting per¬ 
formance on the second task, and OHP having good modeling 
performance under the constrained setting of limited number 
of states. Moonshine has better probabilistic modeling perfor¬ 
mance than mixed LICORS on these tasks, and has statistically 
indistinguishable forecasting capability (see Tables I (100 state 
case) and II). While it might be argued that the improved 
performance was not improved enough , we have to remind 
ourselves that these are approximations - that they improve 
performance at all is surprising. 

Although OHP does have limited forecasting ability, it does 
manage to model at least one of the datasets well, showing 
that its simplified form is not entirely without merit. This, at 
very least, shows when approximations become too simplified 
to accomplish complex tasks. Negative results are important, 
especially when detecting boundaries. 


VII. Theoretical Results 

We state a result for light cone predictive state systems, with 
proof given in Appendix I. 

We wish to bound the error of our estimated distribution 
over futures given pasts, namely, the error of P(X\£~). Lor 
a fixed random sample of data, let P*(X\£~) denote the 
optimal estimate for P(X\£~) constructable from the sample. 
We begin by noting 

\P(X\£~) - P(X\£~)\ < \P(X\£~) - P*(X\£~)\+ 
\P*(X\£~) -P(X\£~)\. 


The second summand on the right-hand side is the gap 
between the optimal estimate and truth, which we assume to 
shrink in probability with the sample size (as in Goerg and 
Shalizi 2012). We focus on first term, which is the gap between 
our light-cone based nonparametric estimator and the optimal 
estimate. Lor this quantity, we state our main result: 


Theorem 1. For a fixed data sample of size N, let P*(X \£~) 
denote the optimal estimator based on that sample and 
P(X \£~) be the light cone estimator based on the same 
sample. Let P(X\Sj) be bounded by a constant M for all 

x J-If 

\ p (Sj\r) - p*(Sj\r)\ -+o 

for all j, then for any X, e > 0, S > 0, and sufficiently large 
N, 


P (\P(X\e~) - P*(X\i~)\ > e) < 2exp 


NK h { 0) 2 J ’ 


where TV* is the (smallest) sum of weights for the predictive 
states and Kh{fi) is a bandwidth h kernel. 



TABLE I 

Results for predicting electrostatic potentials. 


Method 

-Kmax 

MSE 

95% Cl 

Pearson p 

95% Cl 

Avg. LL 

95% Cl 

Perplexity 

Future-like-the-Past 


0.778 

[0.777, 0.780] 

0.615 

[0.614, 0.616] 




KNN Regression 


0.852 

[0.851, 0.853] 

0.506 

[0.505, 0.506] 




Light Cone Linear Regression 


0.607 

[0.606, 0.608] 

0.628 

[0.627, 0.628] 




Mixed LICORS 

100 

0.569 

[0.567, 0.571] 

0.663 

[0.661, 0.665] 

-1.034 

[-1.110, -0.964] 

2.052 

Moonshine 

100 

0.570 

[0.569, 0.572] 

0.656 

[0.655, 0.657] 

-0.672 

[-0.727, -0.617] 

1.593 

One Hundred Proof 

100 

0.592 

[0.591, 0.593] 

0.641 

[0.640, 0.642] 

-1.724 

[-2.127, -1.321] 

3.303 

Mixed LICORS 

10 

0.566 

[0.565, 0.567] 

0.668 

[0.667, 0.669] 

-1.022 

[-1.096, -0.947] 

2.030 

Moonshine 

10 

0.609 

[0.605, 0.613] 

0.625 

[0.622, 0.628] 

-0.722 

[-0.767, -0.678] 

1.650 

One Hundred Proof 

10 

0.597 

[0.595, 0.598] 

0.648 

[0.646, 0.649] 

-0.682 

[-0.757, -0.608] 

1.605 





TABLE II 







Results for predicting video of human speakers. 




Method 

K max 

MSE 

95% Cl 

Pearson p 

95% Cl 

Avg. LL 

95% Cl 

Perplexity 

Future-like-the-Past 


0.031 

[0.031, 0.031] 

0.984 

[0.984, 0.984] 




KNN Regression 


0.033 

[0.033, 0.033] 

0.984 

[0.984, 0.984] 




Light Cone Linear Regression 


0.028 

[0.028, 0.028] 

0.986 

[0.986, 0.0986] 




Mixed LICORS 

100 

0.038 

[0.038, 0.038] 

0.981 

[0.981, 0.981] 

0.102 

[0.099, 0.105] 

0.932 

Moonshine 

100 

0.039 

[0.039, 0.039] 

0.981 

[0.981, 0.981] 

0.925 

[0.874, 0.976] 

0.527 

One Hundred Proof 

100 

1.060 

[0.460, 1.659] 

0.911 

[0.871, 0.952] 

-6.48 

[-8.025, -4.948] 

89.641 


Proof sketch (see appendix for details): For the quan¬ 
tity \P(X \£~) — P*(X\£~)\, we first mix over states, and 
use the chain rule to condition. Then we add and subtract 
P(X\Sj)P*(Sj and split the sum into two parts, one mul¬ 

tiplied by P*(Sj\£~ ) and the other multiplied by P(X\Sj). 
By the assumptions stated, the second sum is bounded and 
decreasing to zero, so that for sufficiently large N it is smaller 
than any S > 0. The first sum is less than maxj \P{X\Sj) — 
P*(X\Sj)\ 9 which we bound with high probability, using a 
Hoeffding bound for dependant data [van de Geer, 2002]. 
The result follows directly from application of the Hoeffding 
bound. 

VIII. Related and Future Work 
A. Related Work 

Our debt to Goerg and Shalizi [2012, 2013] needs no 
elaboration. We share the same general framework, but aim 
at simpler algorithms, even if it costs some predictive power. 
The work on LICORS grows out of earlier work on predic¬ 
tive Markovian representations of non-Markovian time series 
[Knight, 1975, Crutchfield and Young, 1989, Shalizi and 
Crutchfield, 2001, Shalizi and Klinkner, 2004], whose transfer 
to spatio-temporal data was originally aimed at unsupervised 
pattern analysis in natural systems [Shalizi et al., 2004, 2006]; 
our qualitative results suggest Moonshine and OHP remain 
suitable for this, as well as for prediction. The formalism 
used in this line of work is mathematically equivalent to the 
“predictive representations of state” introduced by Littman 
et al. [2002], and lately the focus of much interest in con¬ 
junction with spectral estimation methods [Boots and Gordon, 
2011]. Both formalisms are also equivalent to observable 


operator models [Jaeger, 2000] and to “sufficient posterior” 
representations [Langford et al., 2009]; our approach may 
suggest new estimation algorithms within those formalisms. 

B. Future Work 


Actual Predicted Error 



Fig. 8. Color prediction of human film data, using mixed LICORS light cone 
forecasting. 

Light cone methods, such as the three described here, hold 
promise for the prediction of dynamical systems. Given the 
flexibility and generality of light cone decompositions, one 
can easily extend such methods to handle full-color video (e.g., 
Figure 8), and Kinect™-sensor depth video. These applications 
are the focus of current and future research. 

The “rate limiting step” for approximate light cone methods 
like Moonshine and OHP is the speed of nonparametric 
density estimation. Methods that scale poorly in the number 
of observations are of limited use. Towards that end, future 
research into fast approximate nonparametric density estima¬ 
tion will improve the computational efficiency of the methods 
presented. 



















The theoretical properties of the two predictive state meth¬ 
ods will be further explored in a future paper, especially with 
regard to the trade-offs in their approximation to what LICORS 
or mixed LICORS would do, and the influence of the new 
algorithms’ internal randomness. 

IX. Conclusion 

Faced with the task of learning to accurately model video¬ 
like data, we explore the strengths and drawbacks of light cone 
decomposition methods and propose new simplified nonpara- 
metric predictive state methods inspired by the mixed LICORS 
[Goerg and Shalizi, 2013] algorithm. The methods, Moonshine 
and One Hundred Proof, do not require costly iterative EM 
training or the memory intensive formation of an explicit 
N x K matrix, yet retain the generative modeling capabilities 
and are competitive in predictive performance to the original 
mixed LICORS method. The methods are shown to perform 
well on one real-world data task, effectively capturing spatio- 
temporal structure and outperforming baseline methods, while 
a light cone version of linear regression performs well on the 
remaining task. Overall, we see that light cone decompositions 
of complex spatio-temporal data can open opportunities to 
tractably estimate probability densities and accurately forecast 
the changing systems. By introducing simplified versions of 
light cone algorithms, we hope to encourage further explo¬ 
ration and application of this general technique. 
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Appendix I: Proofs 


Lemma 1. Let fj(-) denote the density for state j under the true assignment matrix VF* and let N* = wij. Given 

isolated change in e in the weight w^, the difference between density estimate fjf) and fj(-) is bound by 
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Furthermore, we can bound this quantity by 
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Lemma 2. Let fj, /*, N* be defined as in Lemma 1. Given a fixed data sample of size N, for all x, a > 0 and c > 0 we have 

( 2(1 + N*) 2 a 2 ) 

P (l/#(*) - //(*)! > a) < 2exp i- \ • 
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Proof Once the sample is fixed, /*(•) becomes a deterministic function of the sample, and N* becomes a deterministic 
constant. Following van de Geer 2002, we define 
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where fij(x) — f*j(x) denotes that the two functions only differ at the ith matrix entry, Lj and Ui are constant (degenerate) 
random variables for a fixed sample and 
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Then, for all x, a > 0 and c > 0, we have 

P(| fj(x) — fj(x) | > a, C 2 < c 2 for some n) < P(/j(x) — f*(x ) > a, < c 2 for some n) + 
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Given a fixed sample of size TV, choose cq such that C 2 < Cq for all 1 < n < N. Then 
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< P(| fj(x) — fj(x )| > a, C 2 < Cq for some n) 
f 2a 2 1 

<2e* P ( — T }. 


Because C 2 < Cq for all 1 < n < N, we have 
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Having already establish that 


c% < N 0)) , we set c 0 = 
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Theorem 1. For a fixed data sample of size N, let P*(X\£~) denote the optimal estimator based on that sample and 
P(X\£~) be the light cone estimator based on the same sample. Let P(X\Sj ) be bounded by a constant M for all X,j. If 
\P(Sj\£~) — P*(Sj\£~)\ 0 for all j, then for any X, e > 0, S > 0, and sufficiently large N, 
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where N* is the (smallest) sum of weights for the predictive states, and Kh(-) is a kernel of bandwidth h. 
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Therefore, 
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For sufficiently large N, P (A>e — B,B>5) = 0 and P (B < S) = 1, given that P(X\Sj) is bounded and 
P(Sj \£~) — P*(Sj\£~) \ 0. Therefore, given N sufficiently large, 
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where the penultimate inequality follows from Lemma 2. 
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Appendix II: Implementation Details 

We now discuss the choosing of various parameter settings 
for the two algorithms, as well as some computational tech¬ 
niques used to improve runtime performance. 

Choosing Number of States 

In both mixed LICORS and Moonshine a user must specify 
the maximum number of predictive states for the model, 
which effectively controls the complexity of the model. In 
OHP, one must specify the exact number of predictive states, 
since the number is determined by a k- means++ [Arthur and 
Vassilvitskii, 2007] clustering step. In all cases, this number 
can be chosen based on user preference for simpler models, 
or cross-validation may be used to find the number of states 
that gives the best predictive performance on held-out data. 


Nonparametric Density Estimation 

Nonparametric density estimation techniques are instance 
based and slow with increasing numbers of observations. Our 
algorithms use kernel density estimators [Rosenblatt et al., 
1956, Parzen, 1962], for which we only retain a randomly 
chosen subsample of five hundred points in each cluster to 
compute the densities. The resulting systems still perform well, 
as shown in §V, while being computationally tractable. 


Dimensionality Reduction Choice in Moonshine 

Another parameter that must be chosen is the degree of 
dimensionality reduction when forming distribution signatures 
in Moonshine. Data can guide this choice (through cross- 
validation), or user preference for more compact models 
can guide the choice for greater degrees of dimensionality 
reduction. The fewer the number of dimensions, the less dis¬ 
criminative the signatures, and thus, the higher the likelihood 
of merging clusters. 


Density Based Clustering Considerations 

When using density based clustering such as DBSCAN [Es¬ 
ter et al., 1996], two issues arise. First, a suitable local neigh¬ 
borhood size must be chosen (controlled by an e parameter). 
Second, such methods can be computationally expensive and 
thus slow. To address the first issue, we take an iterative search 
approach by beginning with very small neighborhood sizes, 
then increase them until a significant portion of the data is 
clustered, but keep the proportion below 100%. To address 
the second issue, we use DBSCAN to cluster only a seed 
portion of all observations, then assign remaining observations 
to nearest cluster centers, which greatly improves runtime. 
Controlling the proportion of data used for seeding versus the 
portion assigned to cluster centers affects the degree of forced 
convexity of resulting clusters, and also determines the total 
runtime of the clustering. Fewer seed points results in faster 
clustering, but with more convex-shaped (e.g., /c-means-like) 
clusters. 


Scaling 

Since Moonshine and OHP cluster based on distances, it 
becomes important to normalize the scaling of all axes and 
dynamic ranges of all experiments. Additionally, if the scale 
of training light cones differs from the scale of test light cones 
predictive performance will suffer. 



